Performance Optimization: Optimized TileShape Configuration for f8 #3617

MatrixAssembler · 2025-01-27T14:42:31Z

Performance Issue with Current F8 TileShape Configuration

The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense f8 tensor core on H100 is m64n256k32.
The current configuration leads to suboptimal performance for
tensor cores and bandwidth usage.

Optimized TileShape (128x256x128) Implementation

Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 for f8.

Benchmark Results on H100 GPU

Benchmark configuration:

PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

Benchmark

f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)

TileShape	TFlops
128-128-128	1244
128-256-128	1374

f8f8bf16_rowwise (M = N = K = 8,192)

TileShape	TFlops
128-128-128	1300
128-256-128	1480

f8f8bf16_tensorwise (M=N=K = 8,192)

TileShape	TFlops
128-128-128	1271
128-256-128	1463

Technical Implementation

Modified TileShape from 128-128-128 to 128-256-128 for:

f8f8bf16_grouped
f8f8bf16_rowwise
f8f8bf16_tensorwise

Added cooperative kernel by default for:

f8f8bf16_rowwise
f8f8bf16_tensorwise

f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise

The modifications only affect large where M > 128 and N > 128 and M or N > 2,048.
The matrices are divided into tiles twice as large, but with kernels using 3
SMs instead of 2. The smaller heuristics of large kernels may experience a
slight reduced efficiency compared to the previous configuration.
An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM.

These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

Test Coverage

Unit Tests Results

The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

@jiawenliu64 @jwfromm Thank you!

- Change TileShape from 128x128x128 to 128x256x128 - Add cooperative kernel by default for f8 kernels

netlify · 2025-01-27T14:42:51Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`83cb7bd`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67be19ac2d30960008ac863b
😎 Deploy Preview	https://deploy-preview-3617--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jiawenliu64 · 2025-01-27T18:00:16Z

Thanks for the contribution, @MatrixAssembler !

Wonder if you observe optimization opportunities with M <= 128 || N <= 128?

facebook-github-bot · 2025-01-27T18:01:47Z

@jiawenliu64 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu

MatrixAssembler · 2025-01-28T02:57:58Z

Thanks for the contribution, @MatrixAssembler !

Wonder if you observe optimization opportunities with M <= 128 || N <= 128?

Indeed @jiawenliu64, the bandwidth of H100 and B100 is so large
that TileShape sizes significantly affect performance,

f8f8bf16_rowwise (M = N = K = 8,192)

TileShape	TFlops
64-128-128	910
128-128-128	1300
128-256-128	1480

We can deduce that for a configuration where 128 >= M > 64
and N > 4224, we get a -30% underperformance
between the 64-128-128 and 128-128-128 kernels.
(for reference, this hasn't been tested empirically yet)

M being the batch size, this can be a common GEMM
configuration in the case of LLMs. Bench shape llama

This is the next contribution I'm preparing,
I have limited access to an H100 GPU and
this empirical research is quite combinatorial.

I would have liked to do the same for B100, to prepare for
Blackwell CUTLASS kernel development.
I don't have access to a Blackwell GPU.

facebook-github-bot · 2025-01-28T03:23:27Z

@jiawenliu64 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…ll be introduced in a future PR with a more efficente selection heuristic approach.

…ytorch#3617) Summary: X-link: facebookresearch/FBGEMM#816 ## Performance Issue with Current F8 TileShape Configuration The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128, while the optimal shape for dense f8 tensor core on H100 is m64n256k32. The current configuration leads to suboptimal performance for tensor cores and bandwidth usage. ## Optimized TileShape (128x256x128) Implementation Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization. This configuration is notably used in Flash Attention V3 for f8. ## Benchmark Results on H100 GPU ### Benchmark configuration: PyTorch 2.6 CUDA 12.4 CPU: AMD EPYC GPU: NVIDIA H100 Benchmarks are configured with 30 kernel launch iterations and averaged over 25 Benchmark calculations. We used the same gemm sizes as in the Colfax benchmarks ### Benchmark #### f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192) | TileShape | TFlops | |-------------|-------- | | 128-128-128 | 1244 | | 128-256-128 | 1374 | #### f8f8bf16_rowwise (M = N = K = 8,192) | TileShape | TFlops | |-------------|------- | | 128-128-128 | 1300 | | 128-256-128 | 1480 | #### f8f8bf16_tensorwise (M=N=K = 8,192) | TileShape | TFlops | |-------------|------- | | 128-128-128 | 1271 | | 128-256-128 | 1463 | ## Technical Implementation Modified TileShape from 128-128-128 to 128-256-128 for: - f8f8bf16_grouped - f8f8bf16_rowwise - f8f8bf16_tensorwise Added cooperative kernel by default for: - f8f8bf16_rowwise - f8f8bf16_tensorwise f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise The modifications only affect large where M > 128 and N > 128 and M or N > 2,048. The matrices are divided into tiles twice as large, but with kernels using 3 SMs instead of 2. The smaller heuristics of large kernels may experience a slight reduced efficiency compared to the previous configuration. An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM. These changes were made by modifying the minimum necessary code while respecting existing coding practices in FBGEMM. ## Test Coverage ### Unit Tests Results The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize have been verified for the modified kernels. jiawenliu64 jwfromm Thank you! Differential Revision: D68719476 Pulled By: jiawenliu64

facebook-github-bot · 2025-02-26T17:30:31Z

@jiawenliu64 merged this pull request in b9acfeb.

…ytorch#816) Summary: X-link: pytorch#3735 Pull Request resolved: facebookresearch/FBGEMM#816 ## Performance Issue with Current F8 TileShape Configuration The current FBGEMM f8 kernel uses a TileShape configuration of 128x128x128, while the optimal shape for dense f8 tensor core on H100 is m64n256k32. The current configuration leads to suboptimal performance for tensor cores and bandwidth usage. ## Optimized TileShape (128x256x128) Implementation Modification of the TileShape configuration from 128x128x128 to 128x256x128 for large GEMM operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization. This configuration is notably used in Flash Attention V3 for f8. ## Benchmark Results on H100 GPU ### Benchmark configuration: PyTorch 2.6 CUDA 12.4 CPU: AMD EPYC GPU: NVIDIA H100 Benchmarks are configured with 30 kernel launch iterations and averaged over 25 Benchmark calculations. We used the same gemm sizes as in the Colfax benchmarks ### Benchmark #### f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192) | TileShape | TFlops | |-------------|-------- | | 128-128-128 | 1244 | | 128-256-128 | 1374 | #### f8f8bf16_rowwise (M = N = K = 8,192) | TileShape | TFlops | |-------------|------- | | 128-128-128 | 1300 | | 128-256-128 | 1480 | #### f8f8bf16_tensorwise (M=N=K = 8,192) | TileShape | TFlops | |-------------|------- | | 128-128-128 | 1271 | | 128-256-128 | 1463 | ## Technical Implementation Modified TileShape from 128-128-128 to 128-256-128 for: - f8f8bf16_grouped - f8f8bf16_rowwise - f8f8bf16_tensorwise Added cooperative kernel by default for: - f8f8bf16_rowwise - f8f8bf16_tensorwise f8f8f16.cu was not modified because it was deprecated compared to f8f8bf16_tensorwise The modifications only affect large where M > 128 and N > 128 and M or N > 2,048. The matrices are divided into tiles twice as large, but with kernels using 3 SMs instead of 2. The smaller heuristics of large kernels may experience a slight reduced efficiency compared to the previous configuration. An empirical study between F8 kernel configurations and GEMM sizes could benefit FBGEMM. These changes were made by modifying the minimum necessary code while respecting existing coding practices in FBGEMM. ## Test Coverage ### Unit Tests Results The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize have been verified for the modified kernels. jiawenliu64 jwfromm Thank you! X-link: pytorch#3617 Reviewed By: sunfish2010 Differential Revision: D68719476 Pulled By: jiawenliu64 fbshipit-source-id: 60705574aa1779e0171fea01addf8f20788c4749

perf: optimize TileShape configuration for f8

350f2b7

- Change TileShape from 128x128x128 to 128x256x128 - Add cooperative kernel by default for f8 kernels

facebook-github-bot added the cla signed label Jan 27, 2025

jiawenliu64 self-requested a review January 27, 2025 18:00

jiawenliu64 reviewed Jan 27, 2025

View reviewed changes

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu Outdated Show resolved Hide resolved

jiawenliu64 reviewed Jan 27, 2025

View reviewed changes

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu Outdated Show resolved Hide resolved

style: fix code formatting

760415c

jiawenliu64 approved these changes Jan 28, 2025

View reviewed changes

Reverting f8f8bf16_rowwise.cu to its original state; modifications wi…

83cb7bd

…ll be introduced in a future PR with a more efficente selection heuristic approach.

facebook-github-bot closed this in b9acfeb Feb 26, 2025

facebook-github-bot added the Merged label Feb 26, 2025

gchalump added category:improvement feature:quantize labels May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Optimization: Optimized TileShape Configuration for f8 #3617

Performance Optimization: Optimized TileShape Configuration for f8 #3617

Uh oh!

MatrixAssembler commented Jan 27, 2025

Uh oh!

netlify bot commented Jan 27, 2025 •

edited

Loading

Uh oh!

jiawenliu64 commented Jan 27, 2025

Uh oh!

facebook-github-bot commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

MatrixAssembler commented Jan 28, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 28, 2025

Uh oh!

facebook-github-bot commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Performance Optimization: Optimized TileShape Configuration for f8 #3617

Performance Optimization: Optimized TileShape Configuration for f8 #3617

Uh oh!

Conversation

MatrixAssembler commented Jan 27, 2025

Performance Issue with Current F8 TileShape Configuration

Optimized TileShape (128x256x128) Implementation

Benchmark Results on H100 GPU

Benchmark configuration:

Benchmark

f8f8bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)

f8f8bf16_rowwise (M = N = K = 8,192)

f8f8bf16_tensorwise (M=N=K = 8,192)

Technical Implementation

Test Coverage

Unit Tests Results

Uh oh!

netlify bot commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

jiawenliu64 commented Jan 27, 2025

Uh oh!

facebook-github-bot commented Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

MatrixAssembler commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

f8f8bf16_rowwise (M = N = K = 8,192)

Uh oh!

facebook-github-bot commented Jan 28, 2025

Uh oh!

facebook-github-bot commented Feb 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Jan 27, 2025 •

edited

Loading

MatrixAssembler commented Jan 28, 2025 •

edited

Loading